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Graphical Gaussian models have proven to be useful tools for ex- 
ploring network structures based on multivariate data. Applications 
to studies of gene expression have generated substantial interest in 
these models, and resulting recent progress includes the development 
of fitting methodology involving penalization of the likelihood func- 
tion. In this paper we advocate the use of multivariate ^distributions 
for more robust inference of graphs. In particular, we demonstrate 
that penalized likelihood inference combined with an application of 
the EM algorithm provides a computationally efficient approach to 
model selection in the t-distribution case. We consider two versions 
of multivariate t-distributions, one of which requires the use of ap- 
proximation techniques. For this distribution, we describe a Markov 
chain Monte Carlo EM algorithm based on a Gibbs sampler as well as 
a simple variational approximation that makes the resulting method 
feasible in large problems. 

1. Introduction. Graphical Gaussian models have attracted a lot of re- 
cent interest. In these models an observed random vector Y = (Y±, . . . ,Y P ) 
is assumed to follow a multivariate normal distribution J\f p (p>,T,), where [i 
is the mean vector and £ the positive definite covariance matrix. Each 
model is associated with an undirected graph G = (V, E) with vertex set 
V = {1, . . . ,p}, and defined by requiring that for each nonedge (j,k) ^ E, 
the variables Yj and Y^ are conditionally independent given all the remain- 
ing variables Y\{j,k}- Here, \{j, k} denotes the complement V\ {j,k}. Such 
pairwise conditional independence holds if and only if S"^ 1 = 0; see Lauritzen 
(1996) for this fact and general background on graphical models. Therefore, 
inferring the graph corresponds to inferring the nonzero elements of X -1 . 



Received December 2009; revised July 2010. 
Supported by NSF Grant DMS-07-46265. 
2 Supported by an Alfred P. Sloan Fellowship. 

Key words and phrases. EM algorithm, graphical model, Markov chain Monte Carlo, 
multivariate t-distribution, penalized likelihood, robust inference. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Applied Statistics. 
2011, Vol. 5, No. 2A, 1057-1080. This reprint differs from the original in 
pagination and typographic detail. 



1 



2 



M. FINEGOLD AND M. DRTON 



Classical solutions to the model selection problem include constraint-ba- 
sed approaches that test the model-defining conditional independence con- 
straints, and score-based searches that optimize a model score over a set of 
graphs. A review of this work can be found in Drton and Perlman (2007). 
Recently, however, penalized likelihood approaches based on the one-norm of 
the concentration matrix S _1 have become increasingly popular. Meinshau- 
sen and Btihlmann 2006 proposed a method that uses lasso regressions of 
each variable Yj on the remaining variables Y\j := Y\{j}- In subsequent work, 
Yuan and Lin (2007) and Banerjee, El Ghaoui and d'Aspremont (2008) dis- 
cuss the computation of the exact solution to the convex optimization prob- 
lem arising from the likelihood penalization. Finally, Friedman, Hastie and 
Tibshirani (2008) developed the graphical lasso (glasso), which is a computa- 
tionally efficient algorithm that maximizes the penalized log-likelihood func- 
tion through coordinate-descent. The theory that accompanies these algo- 
rithmic developments supplies high-dimensional consistency properties un- 
der assumptions of graph sparsity; see, for example, Ravikumar et al. (2009). 

Inference of a graph can be significantly impacted, however, by deviations 
from normality. In particular, contamination of a handful of variables in 
a few experiments can lead to a drastically wrong graph. Applied work thus 
often proceeds by identifying and removing such experiments before data 
analysis, but such outlier screening can become difficult with large data 
sets. More importantly, removing entire experiments as outliers may discard 
useful information from the uncontaminated variables they may contain. 

The existing literature on robust inference in graphical models is fairly 
limited. One line of work concerns constraint-based approaches and adopts 
robustified statistical tests [Kalisch and Buhlmann (2008)]. An approach for 
fitting the model associated with a given graph using a robustified likeli- 
hood function is described in Miyamura and Kano (2006). In some cases 
simple transformations of the data may be effective at minimizing the ef- 
fect of outliers or contaminated data on a small scale. A normal quantile 
transformation, in particular, appears to be effective in many cases. 

In this paper we extend the scope of robust inference by providing a tool 
for robust model selection that can be applied with highly multivariate data. 
We build upon the glasso of Friedman, Hastie and Tibshirani (2008), but 
model the data using multivariate i-distributions. Using the EM algorithm, 
the tlasso methods we propose are only slightly less computationally efficient 
than the glasso but cope rather well with contaminated data. 

The paper is organized as follows. In Section 2 we review maximization of 
the penalized Gaussian log-likelihood function using the glasso. In Section 3 
we introduce the classical multivariate ^distribution and describe maximiza- 
tion of the (unpenalized) log-likelihood using the EM algorithm. In Section 4 
we combine the two techniques into the tlasso to maximize the penalized 
log-likelihood in the multivariate t case. In Section 5 we introduce an alter- 
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native multivariate i-distribution and describe how inference can be done 
using stochastic and variational EM. In Section 6 we compare the glasso 
to our i-based methods on simulated data. Finally, in Section 7 we analyze 
two different gene expression data sets using the competing methods. Our 
findings are summarized in Section 8. 

2. Graphical Gaussian models and the graphical lasso. Suppose we ob- 
serve a sample of n independent random vectors Y\, . . . ,Y n S MP that are 
distributed according to the multivariate normal distribution Af p (fA, £). Like- 
lihood inference about the covariance matrix E is based on the log-likelihood 
function 

£(E) = -^log(2vr) - ^logdet(E) - ^(SIT 1 ), 
where the empirical covariance matrix 
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is defined based on deviations from the sample mean Y . Let = (6jk) — E^ 1 
denote the (p x ^-concentration matrix. In penalized likelihood methods 
a one-norm penalty is added to the log-likelihood function, which effectively 
performs model selection because the resulting estimates of may have 
entries that are exactly zero. Omitting irrelevant factors and constants, we 
are led to the problem of maximizing the function 

(2.1) logdet(e)-tr(Se)-p||6||i 

over the cone of positive definite matrices, where ||0||i is the sum of the 
absolute values of the entries of O. The multiplier p is a positive tuning 
parameter. Larger values of p lead to more entries of O being estimated as 
zero. Cross-validation or information criteria can be used to tune p. 

The glasso is an iterative method for solving the convex optimization 
problem with the objective function in (2.1). Its updates operate on the 
covariance matrix E. In each step one row (and column) of the symmetric 
matrix E is updated based on a partial maximization of (2.1) in which 
all but the considered row (and column) of O are held fixed. This partial 
maximization is solved via coordinate-descent as briefly reviewed next. 

Partition off the last row and column of E = ((Jjk) and S as 

E\ Pi \ p E\ pp \ of \Pi\p \p,p 



V a M> ) ' V S\p, P s pp 

Then, as shown in Banerjee, El Ghaoui and d'Aspremont (2008), partially 
maximizing E\ Pi? , with E\p\ p held fixed yields E\ Pi? , = E\p\ p /3*, where (5* 
minimizes 

||(E V)Ap ) 1 /^_( SVj g-i/2 5w ||2 + p || / 0|| i 
with respect to /3 £ W" 1 . The glasso finds /3* by coordinate descent in each 
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of the coordinates j = 1, . . . ,p — 1, using the updates 

n* _ T ( S 3P ~ Tlk<p,kjtj VkjPjiP) 
Pi i 

where T(x,t) = sgn(x)(|x| — t) + . The algorithm then cycles through the rows 
and columns of £ and S until convergence. The diagonal elements are simply 
o~ pp = s pp + p. See Friedman, Hastie and Tibshirani (2008) for more details 
on the method. 

3. Graphical models based on the t-distribution. 

3.1. Classical multivariate t-distribution. The classical multivariate t- 
distribution t PtU (p,^f) on MP has Lebesgue density 

f( ^ r((^ + p)/2)|^rv 2 

{ ' Ml/,M»*j ( X v¥/2r(u/2)[l + 8y(n,*)/v]i>'W 

with S y (n,^f) = (y - p) T ^~ l {y - fi) and y £ M p . The vector /i G R p and 
the positive definite matrix $ = (^jfc) determine the first two moments of 
the distribution. If Y ~ t Pjl/ (fj,,^f) with ^ > 2 degrees of freedom, then the 
expectation is E[Y] = /x and the covariance matrix is V[Y] = vj{y — 2) • 
From here on we will always assume v > 2 for the covariance matrix to exist. 
For notational convenience and to illustrate the parallels with the Gaussian 
model, we define G = (9jk) = V? . 

If X ~ J\f p (0, is a multivariate normal random vector independent of 
the Gamma-random variable r ~ T(v/2,v/2), then Y = + X/y/r is dis- 
tributed according to t P}U (n, ^f); see Kotz and Nadarajah (2004), Chapter 1. 
This scale-mixture representation, illustrated in Figure 1, allows for easy 
sampling. It also clarifies how the use of ^distributions leads to more robust 
inference because extreme observations can arise from small values of r. An 
additional useful fact is that the conditional distribution of r given Y is 
again a Gamma-distribution, namely, 

(3.2) (r|Y) ^ V + P " + 




Fig. 1. Graph representing the process generating a multivariate t -random vector Y from 
a latent Gaussian random vector X and a single latent Gamma-divisor. 
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Let G = (V, E) be a graph with vertex set V = {1, . . . ,p}. We define the 
associated graphical model for the t-distribution by requiring that 6jk = 
for indices j ^ k corresponding to a nonedge (j, k) ^ E. This mimics the 
Gaussian model in that zero constraints are imposed on the inverse of the 
covariance matrix. However, in a t-distribution this no longer corresponds to 
conditional independence, and the density f u {y, ^) does not factor accord- 
ing to the graph. The conditional dependence manifests itself, in particular, 
in conditional variances in that even if Ojk = 0, 

vk-I^I^vk-I^,*}]. 

For a simple illustration of this inequality, let f be a diagonal matrix. Then 

vrviv l mir*Mv 1 1 iR-r-r-iiv i 1 u + 5Y \M\v^\j\j) 
vK-|y v ] = nxj MY\ 3 ] = Q--nr \y Xj ] = — -^— 3 , 

which can be shown by taking iterated conditional expectations, and using 
that 

nxf\Y\ J ,T]=E[X]\X\ J ,r]=Y[X,\X\ J ] = -L 

"33 

and that r given Y\j has a Gamma-distribution; recall (3.2). Clearly, V[lj|l\j] 
depends on all Yfc, k^= j. 

Despite the lack of conditional independence, the following property still 
holds (proved in the Appendix). 

Proposition 1. Let X ~ W p (0, B" 1 ), where 9jk = for pairs of indices 
j 7^ k that correspond to nonedges in the graph G. Let r be independent of X 
and follow any distribution on the positive real numbers with E[l/r] < oo and 
define Y = pL + X/y/r. Lf two nodes j and k are separated by a set of nodes C 
in G, then Yj and Y^ are conditionally uncorrelated given Yq. 

The edges in the graph indicate the allowed conditional independencies 
in the latent Gaussian vector X. According to Proposition 1, however, we 
may also interpret the graph in terms of the observed variables Yj . The zero 
conditional correlations entail that mean-square error optimal prediction of 
variable Yj can be based on the variables Y^ that correspond to neighbors 
of the node j in the graph, which is a very appealing property. 

3.2. EM algorithm for estimation. The lack of density factorization prop- 
erties complicates likelihood inference with i-distributions. However, the 
EM algorithm provides a way to circumvent this issue. Equipped with the 
normal-Gamma construction, we treat r as a hidden variable and use that 
the conditional distribution of Y given r is jV p (fJ>, ^/t). We now outline the 
EM algorithm for the t-distribution assuming the degrees of freedom v to 
be known. If desired, v could also be estimated in a line search that is best 
based on the actual i-likelihood [Liu and Rubin 1995]. 
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Consider an ra-sample Yi, . . . ,Y n drawn from t PjV (/j,, ^). Let t\, . . . ,T n be 
an associated sequence of hidden Gamma-random variables. Observation of 
the Tj would lead to the following complete-data log-likelihood function for [i 
and 6 = vP -1 : 

4*(M, Q\Y, t) oc I log det(9) - i tr ( G £ r^lf J 
(3-3) n V 1=1 n ) 

8=1 1=1 

where, with some abuse, the symbol oc indicates that irrelevant additive 
constants are omitted. The complete-data sufficient statistics 



S T = ^2 Ti ' StY = ^2 Ti Yi ' StYY = X] TiYiY i 



rT 

1 i 

i=l i=l i=l 



are thus linear in r. We obtain the following EM algorithm for computing 
the maximum likelihood estimates of fi and 

E-step: The E-step is simple because 

(3-4) E[t\Y] = V s + . P — . 

Given current estimates fj,® and VP®, we compute in the (t + l)st iteration 

(t+i) _ 

Ti I/ + (5y(/iW,^W)' 

M-step: Calculate the updated estimates 

(o r\ t+1) _ 2^=1 T i 1 % 

^• Dj ^ _ (<+l) ' 

2^i=i T i 

n 

(3.6) M^ 1 ) = I - - M (4+1) ] T . 

2=1 

4. Penalized inference in t-distribution models. Model selection in graph- 
ical i-models can be performed, in principle, by any of the classical constraint- 
and score-based methods. In score-based searches through the set of all undi- 
rected graphs on p nodes, however, each model would have to be refit using 
an iterative method such as the algorithm from Section 3.2. The penalized 
likelihood approach avoids this problem. 

Like in the Gaussian case, we put a one-norm penalty on the elements 
of and wish to maximize the penalized log-likelihood function 

n 

(4.1) ^,obs(/i,e|Y) = ^log/ 1/ (y 4 ; / u,G- 1 ) - p||e||i, 

1=1 
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where f v is the i-density from (3.1). To achieve this, we will use a modified 
version of the EM algorithm taking into account the one-norm penalty. 

We treat r as missing data. In the E-step of our algorithm, we calculate 
the conditional expectation of the penalized complete-data log-likelihood 

Tl Tl 

(4.2) £ P jM(li,e\Y,T) oc -log|G| - -tr(eS TY Y(M)) - p\\Q\\i 
with 

1 - 

S t yy(jj) = -Y j r i {Y i - fx)(Yi - M ) T 
n ^-^ 

i=l 

Since £ P: hid(fJ-, ®\Y, r) is again linear in r, the E-step takes the same form 
as in Section 3.2. Let iv® and 8" be the estimates after the ith iteration, 
and t\ 1 the conditional expectation of T{ calculated in the (i + l)st E-step. 
Then in the M-step of our algorithm we wish to maximize 

Tl Tl 

- log |9| - - tr(GS T ( t+1)yy (/i)) - p||e||i 

with respect to \i and 0. Differentiation with respect to \x yields 
from (3.5) for any value of 0. Therefore, 0( i+1 ) is found by maximizing 

(4.3) ^l g|e|-^tr(GS r(t+1)yy ( / i( t+1 )))-p||G|| 1 . 

The quantity in (4.3), however, is exactly the objective function maximized 
by the glasso. 

Iterating the E- and M-steps just described, we obtain what we call the 
tlasso algorithm. Since the one-norm penalty forces some elements of Q ex- 
actly to zero, the tlasso performs model selection and parameter estimation 
in a way that is similar to structural EM algorithms [Friedman (1997)]. Con- 
vergence to a stationary point is guaranteed in the penalized version of the 
EM algorithm [McLachlan and Krishnan (1997), Chapter 1.6]; typically a lo- 
cal maximum is found. Note also that the maximized log-likelihood function 
is not concave, and so one finds oneself in the usual situation of not being 
able to give any guarantees about having obtained a global maximum. 

5. Alternative model. 

5.1. Specification of the alternative t-model. The tlasso from Section 4 
performs particularly well when a small fraction of the observations are con- 
taminated (or otherwise extreme). In this case, these observations are down- 
weighted in entirety, and the gain from reducing the effect of contaminated 
nodes outweighs the loss from throwing away good data from other nodes. In 
high-dimensional data sets, however, the contamination, or other deviation 
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Fig. 2. Graph representing the process generating a t* -random vector Y from a latent 
Gaussian random vector X and independent latent Gamma-divisors. 

form normality, may be in small parts of many observations. Downweighting 
entire observations may then no longer achieve the desired results. We will 
demonstrate this later in simulations (see the bottom panel of Figure 4). 

To handle the above situation better, we consider an alternative extension 
of the univariate i-distribution, illustrated in Figure 2. Instead of one divi- 
sor r per p-variate observation, we draw p divisors Tj. For j = 1, . . . ,p, let 
Tj ~ r(i//2, v/2) be independent of each other and of X ~ Ap(0, ^>). We then 
say that the random vector Y with coordinates Yj = [ij + Xj/^/fj follows 
an alternative multivariate i-distribution; in symbols Y ~ t* v {^, ^). 

Unlike for the classical multivariate i-distribution, the covariance ma- 
trix Y[Y] is no longer a constant multiple of = (ijjjk) when Y ~ t* ^). 
Clearly, the coordinate variances are still the same, namely, 

v K'] = ^-^'> 

but the covariance between Yj and Y/% with j ^ k is now 

^r((^-l)/2) 2 v 

2r(^/2)2 ' V>' k - v - 2 ' tjk - 

The same matrix ^ thus implies smaller correlations (by the same constant 
multiple) in the t*-distribution. This reduced dependence is not surprising 
in light of the fact that now different and independent divisors appear in 
the different coordinates. Despite the decrease in marginal correlations, the 
result of Proposition 1 does not hold for conditional correlations in the alter- 
native model. That is, = does not imply Yj and Yfc are conditionally 
uncorrelated given Y^y ^. Interpretation of the graph in the alternative 
model is thus limited to considering edges to represent the allowed condi- 
tional dependencies in the latent multivariate normal distribution. 

The following simulation confirms the result and illustrates the effect. We 
consider a t3 3 (O,0 _1 ) distribution with 

1 -0.5" 
1 -0.5 
-0.5 -0.5 1 
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Fig. 3. Sample correlation of Y\ and Yi for observations with Y3 in a window of size 
0.01. 



and draw independent samples until we have 500,000 observations with x < 
I3 < x + 0.01 for 120 values of x in the range (—6, 6). The sample correlations 
of Y\ and I2 given the varying values of I3 are shown in Figure 3. 

5.2. Alternative tlasso. Inference in the alternative model presents some 
difficulties because the likelihood function is not available explicitly. The 
complete-data log-likelihood function I* hid (/i, 8|Y, r), however, is simply 
the product of the evaluations of p Gamma-densities (r being a vector 
now) and a multivariate normal density. We can thus implement an EM- 
type procedure if we are able to compute the conditional expectation of 
£* hid (n,@\Y,T) given Y = (Yi,...,Y n ). This time we treat the p random 
variables (th, . . . ,Ti p ) as hidden for each observation i = l,...,n. Unfor- 
tunately, the conditional expectation is intractable. It can be estimated, 
however, using Markov Chain Monte Carlo. 

The complete-data log-likelihood function is equal to 

Tl Tl 

(5.1) ^ hid (/i,e,|y,r)oc-log|8| - -tr(es* TYY (»))- P \\e\\i, 

where 

1 - 

i=i 
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and D^y/Ti) is the diagonal matrix with sfri = y^i, . . . , ^Jt^> along the di- 
agonal. The trace in (5.1) is linear in the entries of the matrix ^Jri^/Ti' '. 
A Gibbs sampler for estimating the conditional expectation of this matrix 
given Y cycles through the coordinates indexed by j = 1, . . . ,p and draws, 
in its mth iteration, a number rj^ from the conditional distribution of Ty 
given (Ti\j-,y). This full conditional has density 

(5.2) f{n 3 \T Av Yi) = C(a, /3, 7 ) ■ r^" 1 exp{-njP - 

with 

(5-3) « = ^, /3= " + ( ^7 /ij)2 ^ , " (V- ,',)«, , A', 

and normalizing constant C(a, /3,-y). This constant can always be expressed 
using hyper geometric functions, but, as we detail below, much simpler for- 
mulas can be obtained for the small integer degrees of freedom v that are of 
interest in practice. The simpler formulas are obtained by partial integration. 

From (3 and 7 in (5.3), form the ratio 7' = 7/(2i//3). In order to sample 
from the distribution in (5.2), we may draw from 

(5.4) / a ,y (t) = C(a, 7') • t a ~ l exp{-t - Vt2 7 '} 

and divide the result by f3. For our default of v = 3, that is, a = 2, we thus 
need to sample from 

(5.5) f 1 (t) = C( 1 )-texp{-t- Vt2 7 }. 

Writing for the cumulative distribution function of the standard normal 
distribution, the normalizing constant becomes 

(5.6) 1/C( 7 ) = 1 + 7 2 - 7 (2 7 2 + 3) V^exp{ 7 2 }(l - $( 7 \/2)). 

For 7 = 0, the density / 7 (t) is a T(2, 1) density. For moderate 7, we are thus 
led to the following rejection sampling procedure to draw from / 7 . 

Let gs be the density of a r(2,<5) distribution. Rejection sampling us- 
ing the family of densities gs as instrumental densities proceeds by drawing 
a proposal T~ T(2,5) and a uniform random variable U ^U(0, 1) and ei- 
ther accept if U < f(T)/(M$g$(T)) or repeat the process until acceptance. 
Here, Ms is a suitable multiplier such that f(t) < M$gs(t) for all t > 0. 

An important ingredient to the rejection sampler is the parameter 5, which 
we choose as follows. In the case 7 < 0, the density gs has a heavier tail than / 
provided that 6 < 1. Focusing on the case a = 2, we have that for a given 
5 < 1 the smallest M such that f(t) < Mg s (t) for all t > is 
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Varying 5, the multiplier Ms is minimized at 

4 

If 7 > 0, then setting 5 = 1 yields a heavy enough tail and Ms = C. 

The rejection sampling performs draws from the exact conditional dis- 
tribution f(Tij\Ti\j, Y). We find it works very well for data with not too 
extreme contamination such as, for instance, in the original as well as boot- 
strap data from the application discussed in Section 7.2. When applied to 
data with very extreme observations Yy, however, one is faced with larger 
positive values of 7. In this case the instrumental densities gs provide a poor 
approximation to the target density / 7 , and the acceptance probabilities in 
the rejection sampling step become unpractically low. 

For 7 > 1, we thus use an alternative rejection procedure. Make the trans- 
formation s = \ft. We then wish to sample from 

/i 7 (s) = 2C(7) • s 3 exp{-s 2 - s2 7 }. 

Any T(a, S) distribution has a heavier tail than the target distribution /i 7 (s). 
While it is not possible to find an analytical solution for the optimal a and 5, 
letting a = 1 and 5 = (7 + l)/2 yields acceptance probabilities between 40% 
and 50% for most plausible values of 7. Since this alternative procedure 
will only be needed occasionally, these acceptance problems are adequate. 
Using this hybrid approach yields overall acceptance probabilities greater 
than 98% for the data with extreme contamination described in Section 7.1. 

Returning to the iterations of the overall sampler, we calculate y^y^ 
at the end of each cycle through the p nodes, and then take the average 
over M iterations. This solves the problem of carrying out one E-step, and 
we obtain the following stochastic penalized EM algorithm, which we call 
the Monte Carlo t* -lasso (or t^fj-lasso for short): 

E-step: Given current estimates /x® and compute {^/^^/^ : i T )^ t+1 ^ by 
averaging the matrices obtained in some large number M of Gibbs sampler 
iterations, as described above. 

M-step: Calculate the updated estimates 

(t+i) _ l^i=i T ij I ij 

h ~ V n (t+1) • 

Use these and {^/^^/Ti T ) ( ' t+1 ^ to compute the matrix 5* (t+1)yy to 
be plugged into the trace term in (5.1). Maximize the resulting penalized 
log-likelihood function using the glasso. 
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5.3. Variational approximation. The above Monte Carlo procedure loses 
much of the computational efficiency of the classical tlasso from Section 4, 
however, and can be prohibitively expensive for large p. For large prob- 
lems, we turn instead to variational approximations of the conditional densi- 
ty f(Ti\Yi) of the vector given the observed vector YJ. 

The variational approach proceeds by approximating the conditional den- 
sity f(Tij\Yi) by a factoring distribution. In our context, however, it is easier 
to approximate the joint density f(ri,Y) = f (ri) f (Yi\ri) instead. The first 
term is already in product form because we are assuming the individual di- 
visor Tij to be independent in the model formulation, and the second term 
is the density of the multivariate normal distribution 

AA p (^,z?(i/^)e" 1 D(i/^)). 

We approximate this normal distribution by a member of the set of multi- 
variate normal distributions with diagonal covariance matrix. Application of 
this naive mean field procedure, that is, choosing a distribution by minimiz- 
ing Kullback-Leibler divergence, leads to the approximating distribution 

(5.7) A/^Da/v^e-^l/v^)), 

where is the diagonal matrix with the same diagonal elements as [Wain- 
wright and Jordan (2008), Chapter 5]. Writing q*(Y\r) for the density of 
the distribution in (5.7), our approximation thus has the fully factoring form 
q*. Y-{ T ii^i) = f( T i)l*(Xi\Ti). The resulting conditional distribution also fac- 
tors as 

v 

<f(j i \Y i ) = \[g(T ij \Y ij ), 

i=i 

where g(rij\Yij) is the density of the Gamma-distribution T(aij, /3ij), with 
its parameters corresponding to the quantities a and /3 in (5.3). 

In conclusion, the variational E-step consists of calculating, for each ob- 
servation Yi, the expectations 

Pij L{ a ij)VPij 

and Efy^y^lyj] = E 9 [y^|Ky]E 9 [y / Tfc|Yj/ c ]. These values are then substi- 
tuted into (5.1). The M-step is the same as in the t^ c -lasso. 

The effect of the variational approximation is that the weight for node j 
in observation i is based solely on the squared deviation from the mean, 
(Yij — fij) 2 and the conditional variance For a given deviation from 

the mean, the larger the conditional variance of the node, the smaller the 
weight given to that node in that observation. But unlike in the t^Q-lasso, 
no consideration is given to deviation from the conditional mean of the node 
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in question given the rest. Some relevant information is therefore not being 
used, but in our simulations the effect was not noticeable. 

The resulting variational t* -lasso (t* ar -lasso) is only slightly more expen- 
sive than the tlasso and, despite the relatively crude approximation in the 
variational E-step, performs well compared with the t^Q-lasso. Because of 
this, we will use exclusively the t* ar -lasso when considering the alternative 
model in the simulations in the next section. 

6. Simulation results. 

6.1. Procedure. We used simulated data to compare the three procedures 
glasso, tlasso and t* ar -lasso as follows. We generated a random 100 x 100 
sparse inverse covariance (or dispersion) matrix according to the following 
procedure: 

(a) Choose each lower-triangular element of independently to be —1, 
or 1 with probability 1%, 98% and 1%, respectively. 

(b) For j > k set O^j = 9jk- 

(c) Define 6kk = l + h where h is the number of nonzero elements in the feth 
row of 0. 

The final step ensures a strictly diagonally dominant, and thus positive- 
definite matrix. To strengthen the partial correlations, we reduced the diag- 
onal elements by a common factor. We made this factor as large as possible 
while maintaining positive-definiteness and stability for inversion. For these 
particular matrices, fixing a minimum eigenvalue of 0.6 worked well. 

We then generated n = 50 observations from the A/ioo(0, O" 1 ) distribution 
and ran each of the three procedures with a range of values for the one-norm 
tuning parameter p. To compare how well the competing methods recovered 
the true edges, we drew ROC curves. We ran this whole process 250 times 
and then repeated the entire computation, drawing data from iroo,3(0, _1 ) 
and then iioo3(O>0 _1 ) distributions. 

Simulating from ^distributions produces extreme observations, but a more 
realistic setting might be one in which normal data is contaminated in some 
fashion. For instance, consider broken probes or misread observations in 
a large gene expression microarray. Suppose the contaminated data are not 
so extreme as to be manually screened or otherwise identified as obvious 
outliers. To simulate this phenomenon, we generate normal data as above, 
but randomly contaminated 2% of the values with data generated from in- 
dependent univariate M(p*,0.2) random variables, where pf is equal to 2.5 
times the largest diagonal element of _1 . These contaminated values will be 
similar in magnitude to the 2% tail of the original A/ioo(0, _1 ) distribution 
and therefore difficult to identify. 
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Finally, we would like to compare our developed i-procedures with sim- 
pler approaches to robust inference. There are many ways to obtain robust 
estimates of the covariance matrix, but these usually require n> p. Instead 
we obtain a robust estimate for the marginal covariances and variances using 
the procedure of Kalisch and Biihlmann (2008). Since this is not guaranteed 
to result in a positive definite matrix, we add a constant, c, to the diago- 
nal elements of the matrix, where c is the minimum constant necessary to 
ensure the resulting matrix is nonnegative definite. We then use this robust 
estimate of the covariance matrix as input into the glasso and refer to this 
procedure as the robust glasso. 

6.2. Results. Our tlasso and t* &r -lasso are computationally more expen- 
sive, since they call the glasso at each M-step. But in our simulations, the 
algorithms converge quickly. If we run through multiple increasing values of 
the tuning parameter p for the one-norm penalty, it may take about 15-30 
EM iterations for the initial small value of p, but only 2 or 3 iterations for 
later values, as we can "warm start" at the previous output. But even in the 
initial run, two iterations typically lead to a drastic improvement (in the t 
likelihood) over the glasso. 

The only caveat is that the function being maximized by the tlasso meth- 
ods is not guaranteed to be unimodal. We thus started in several places, and 
let the algorithm run for longer than probably necessary in practice. We 
did not observe drastically different results from different starting places. 
Nonetheless, since we are not guaranteed to find a global maximum, the 
statistical performances of the tlasso and t* aT -lasso may, in principle, be 
understated here (and, of course, the computational efficiency overstated). 

In the worst case scenario for our procedures relative to the glasso — 
when the data is normal and we assume i-distributions with 3 degrees of 
freedom — almost no statistical efficiency is lost. In the numerous simulations 
we have run using normal data, the tlasso and glasso do an essentially equally 
good job of recovering the true graph (see Figure 4). The t* ar -lasso performs 
surprisingly well at small to moderate false discovery rates. The robust glasso 
is based on a less efficient estimator and does not perform as well as the other 
procedures. 

For data generated from a classical t-distribution with 3 degrees of free- 
dom, the tlasso provides drastic improvement over the glasso at the low false 
positive rates that are of practical interest. The assumed normality and the 
occasional extreme observation lead to numerous false positives when us- 
ing the glasso. Therefore, there is very little computational — and little or 
no statistical — downside to assuming t-distributions, but significant statis- 
tical upside. Interestingly, the t* ai -lasso performs about as well as the tlasso. 
The robust glasso outperforms the purely Gaussian procedure at low false 
positive rates, since it is less susceptible to the most extreme observations. 
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Fig. 4. ROC curves depicting the performances of the four methods under four different 
types of data. Each curve is an average over 250 simulations. 



In the third case, with data generated from the alternative t-distribution 
with 3 degrees of freedom, only the t* &r -lasso is able to recover useful in- 
formation without substantial noise. The occasional large values are too 
extreme for the normal model to explain and downweighting entire obser- 
vations, as is done by the tlasso, discards too much information when there 
are extreme values scattered throughout the data. The robust glasso offers 
only a small improvement over the glasso. 

With the contaminated data, the t* &r -lasso does not perform as well in 
this case as it does with t* data. The extreme values are not downweighted as 
much and, thus, the signals are noisier. It still performs far better, however, 
than either of the other methods, and is able to recover valuable information 
in a case where manual screening of outliers would be very difficult. The 
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robust glasso does not perform as well as the t*^ T -lasso, but offers a clear 
improvement over the glasso and might be a useful alternative. 

6.3. Notes on simulation. The simulations show that the tlasso performs 
very similarly to the glasso even with normal data. While one would expect 
a model based on the i-distribution to fare better with normal data than 
a normal model would with t data, the fact that there is almost no statistical 
loss from the model misspecification is at first a bit surprising. The similarity 
of the results can be explained, however, by comparing the two procedures. 
In effect, the only difference is that the tlasso inputs a weighted sample 
covariance matrix into the glasso procedure; one can then think of the glasso 
as the tlasso with all weights set to one. 

As noted in Section 3.2, these weights are the conditional expectations 
of r, which are, from equation (3.4), 

where v is our estimate or assumption of the unknown degrees of freedom. If 
Y ~ t PtU (n, \E r ) and v > 4, then 5y (a*, jp is distributed according to the T p>v 
distribution [Kotz and Nadarajah (2004), Chapter 3]. Thus, starting with 
the true values of /i and ip, the variance of the inverse weights is 



V 



v +p 



2pu 2 (p + v 



(v-2) 2 (v-A)(v + p) 2 ' 

For normal data (i.e., v = oo), the variance is 2p/(p +p) 2 and goes to very 
quickly as p gets large, no matter the assumed value of v. If our current 
estimate of is reasonably close to the true 0, then the observations will 
likely have very similar weights and the weighted covariance matrix will be 
very close to the sample covariance matrix. For t data, the above variance 
tends to 2v 2 j(v - 2) 2 (v - 4) for large p; so no matter how many variables 
we have, the distribution of the inverse weights will have positive variance 
and the tlasso and glasso estimates are less likely to agree. 



7. Gene expression data. 



7.1. Galactose utilization. We consider data from microarray experi- 
ments with yeast strands [Gasch et al. (2000)]. As in Drton and Richardson 
(2008), we limit this illustration to 8 genes involved in galactose utilization. 
An assumption of normality is brought into question, in particular, by the 
fact that in 11 out of 136 experiments with data for all 8 genes, the mea- 
surements for 4 of the genes were abnormally large negative values. In order 
to assess the impact of this handful of outliers, we run each algorithm, ad- 
justing the penalty term p such that a graph with a given number of edges is 
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Fig. 5. Top 9 recovered edges: (a) glasso, (b) tlasso, (c) t^Q-lasso. Dashed edges were 
recovered only when including the outliers; dotted only when excluding them; solid in both 
cases. 

inferred. Somewhat arbitrarily we focus on the top 9 edges. We do this once 
with all 136 experiments and then again excluding the 11 potential outliers. 

As seen in Figure 5, the glasso infers very different graphs, with only 3 
edges in common. When the "outliers" are included, the glasso estimate 
in Figure 5(a) has the 4 nodes in question fully connected; when they are 
excluded, no edges among the 4 nodes are inferred. The tlasso does not 
exhibit this extreme behavior. As seen in Figure 5(b), it recovers almost the 
same graph in each case (7 out of 9 edges shared). When run with all the 
data, the r estimate is very small (~0.04) for each of the 11 questionable 
observations compared with the average r estimate of 1.2. The graph in 
Figure 5(c) shows the results from the t^Q-lasso which performs just as 
well as the tlasso. The t* ai .-lasso also recovered 7 edges in both graphs (not 
shown) and infers relationships similar to those found by the t^Q-lasso. 

Figure 6 illustrates the flexibility of the weighting schemes of the var- 
ious procedures. Both t* procedures downweight the 11 potential outliers 
observations for the 4 nodes in question, but not for the other nodes. Thus, 
the alternative version is able to extract information from the "uncontam- 
inated" part of the 11 observations while downweighting the rest. In this 
particular case, with 125 other observations, downweighting the outliers is 
of primary importance, and, thus, the increased flexibility of the t^Q-lasso 
over the tlasso does not make much of a difference in the inferred graphs. 
This might not be the case with a higher contamination level. 
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tlasso Gene expression data t* MC -lasso t* var -lasso 

Fig. 6. From left to right, inverse weights from tlasso, followed by normalized gene ex- 
pression data, and inverse weights from t^Q-lasso and t^ -lasso. Rows correspond to genes 
and columns to observations. Lighter shades indicate larger values. The tlasso uses only 
one weight per observation and so must weight each gene the same. All plots show the 
same subset of data including 11 potential outliers. 

7.2. Isoprenoid pathway. We next consider gene expression data for the 
isoprenoid pathways of Arabidopsis thaliana discussed in Wille et al. (2004). 
Gene expressions were measured in 118 Affymetrix microarrays for 39 genes. 
While the data set described in the above section had clear deviations from 
normality, the data described in this section has no obvious deviations that 
stand out in exploratory plots. 

Two approaches were considered in Wille et al. (2004). The first (GGM1) 
fit a Gaussian graphical model using BIC and backward selection to obtain 
a network with 178 edges. This number was deemed too large for interpre- 
tation, and the authors considered instead only the 31 edges found in at 
least 80% of bootstrapped samples. The second approach [GGM2) tests the 
conditional independence of each pair of genes given a third gene. An edge 
is drawn only if a test of conditional independence is rejected for each other 



ROBUST GRAPHICAL MODELING USING T-DISTRIBUTIONS 



19 



gene in the network. This approach is advocated in the paper and appears 
to find a network with better biological interpretation. The graph is shown 
in Figure 7, where shaded nodes indicate the so-called MEP pathway. 

Our approach is modeled after GGM1. We used the t* ar -lasso and increas- 
ing values or p to find a path of models to test. For each chosen model, we 
ran the t* &r -lasso again, but this time without penalty on the allowed edges. 
Since the t* likelihood is unavailable, we use leave-one-out cross-validation 
to find the model with the lowest mean squared prediction error. Since the 
exact conditionals from the alternative distribution are not available in ex- 
plicit form, we perform the cross-validation as follows: 

(a) Estimate using all but one observation. 

(b) In the remaining observation, estimate the values of the latent normal 
variables for all but one of the coordinates in the same manner as the 
variational E-step of Section 5.3. 



DXPSl 




DXPS2 




DXPS3 




Fig. 7. A reproduction of the graph produced by Wille et al. Solid undirected edges are 
those found by the model selection procedure; dotted arrows show the metabolic pathway. 
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(c) Predict the remaining normal value. 

(d) Scale the normal value by the expectation of 1/y/r. 

We remark that we also experimented with leaving out a larger fraction of 
the observations as suggested in the work of Shao (1993), but this led to 
similar conclusions in the present example. 

The cross-validation procedure gave a network with 122 edges. To reduce 
to the graph size found by GGM2, we took 500 bootstrapped samples of the 
data, fixing the parameter p found in cross-validation, and only included 
those edges found in more than 98.5% of the samples. For comparison, we 
also ran the above procedure using the glasso, but keeping 98% of the sam- 
ples to obtain the same-sized graph. 

We believe our procedure infers a graph that compares favorably (in terms 
of biological interpretation) with that found by GGM2. Like GGM2, we find 
a connection between AACT2 and the group MK, MPDC1 and FPPS2; 
GGM1 found AACT2 to be disconnected from the rest of the graph de- 
spite its high correlation with these three genes. In the MEP pathway, our 
approach and GGM2 find similar structure; compare Figures 7 and 8. 

While our approach finds the key relationships identified in Wille et al., it 
achieves this with fewer "cross-talk" edges between the two pathways. The 
authors discuss plausible interpretations for such interactions between the 
pathways, but a graph with less cross-talk might be closer to the scientists' 
original expectation (Figures 7 and 8). It is worth noting that the glasso 
procedure performs better than GGM1, with edge inclusion being far less 
sensitive to the particular bootstrapped sample. The glasso also finds the 
key relationships of GGM2. We also ran the tlasso, which gave results similar 
to the glasso and with the t^Q-lasso, which behaved similar to the t* ar -lasso. 
We do not show these results here. 

8. Discussion. Our proposed tlasso and t^-lasso algorithms are simple 
and effective methods for robust inference in graphical models. Only slightly 
more computationally expensive than the glasso, they can offer great gains 
in statistical efficiency. The alternative t distribution is more flexible than 
the classical t and is generally preferred. We find that the simple variational 
E-step is an efficient way to estimate the graph in the alternative case, but 
also explored more sophisticated Monte Carlo approximations. 

We assumed v = 3 degrees of freedom in our various tlasso and t* -lasso 
runs. As suggested in prior work on i-distribution models, estimation of the 
degrees of freedom can be done efficiently by a line search based on the 
observed log-likelihood function in the classical model. 

In the alternative model, the choice of v puts an explicit upper bound on 
the maximum correlation between two variables, the upper bound increasing 
quickly with v (see Figure 9). This makes inference of the degrees of freedom 
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Fig. 8. Graphs recovered by bootstrapping procedure with target graph size of 43 using (a) 
the glasso and (b) t^-lasso. The graph shows the key relationships identified previously, 
but with fewer "cross-talk" edges. 



22 



M. FINEGOLD AND M. DRTON 

Pairwise Correlations in the Alternative t Model 
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Fig. 9. The alternative t model places an upper bound on the correlation between two 
variables. This bound increases with v , but is fairly restrictive for the small degrees of 
freedom we consider. 

potentially more relevant than with the classical model, as an alternative 
model with small v might not be a good fit for highly correlated variables. 
In order to select v, a line search based on the hidden log- likelihood function 
can be employed. For further flexibility, we may also allow the degrees of 
freedom to vary with each node. That is, we could let the divisors tj ~ 
r(z/j/2, Vj/2) be independent T-divisors with possible different degrees of 
freedom Vj. This leads to similar conditionals in the Gibbs sampler and 
the resulting procedure is thus no more complicated. Nevertheless, for the 
purposes of graph estimation, our experience and related literature suggest 
that not much is lost by considering only a few small values for the degrees of 
freedom. For instance, running the t* ar -lasso procedure in Section 7.2 using 
v = 5 produces a very similar result with one additional cross-talk edge. 

In the last section we used cross-validation to choose the one-norm tuning 
parameter p. The likelihood is not available explicitly for the t* -distribution 
and so we cannot easily use information criteria for the t* -lasso. Cross- 
validation often tends to pick more edges than is desirable, however, when 
the goal is inference of the graph and not optimal prediction. An interesting 
but potentially difficult problem for future research would be to develop rules 
for choosing p that control an edge inclusion error rate; compare Banerjee, 
El Ghaoui and d'Aspremont (2008); Meinshausen and Buhlmann (2006). 

Throughout the paper, we have penalized all the elements of 0. One 
alternative is to remove the penalty from the diagonal elements of 0, since 
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we expect all these to be nonzero. This leads to smaller estimated partial 
correlations, and we found it to result in less stable behavior of the tlasso in 
the sense of the number of edges decreasing rather suddenly as p increases. 

Finally, we remark that other normal scale-mixture models could be treated 
in a similar fashion as the t-distribution models we considered in this paper. 
However, the use of t-distributions is particularly convenient in that it is 
rather robust to various types of misspecification, involves only the choice of 
the degrees of freedom parameters for the distribution of Gamma-divisors, 
and maintains good efficiency when data are Gaussian. 

APPENDIX 

Proof of Proposition 1 . According to standard graphical model the- 
ory [Lauritzen (1996)], it suffices to show that Yj and Y k are conditionally 
uncorrelated given ^V\{j,fc}- Partition V into a = {j, k] and b = V \ {j, k}. 
For a given value of r, 

(Y a \Y b , r) ~ N 2 {p a - @-i a e a>b (Y b - p b ), 6-l/r) 

and 

(Yj\Ykub,r) ~ N(m - 6jJe jjkub (Y kub - UkubWjf/r). 
Since 6j k = 0, 

EK'l^ub, r] = fij - ej/& j>b (Y b - to) = E[Y 3 \Y b , r] 
for any value of r. Therefore, 

E[Yj\Y kub ] =EpE[y-|y fcU 6,r]|y fc u6] =EpE[59|y 6 ,r]|y 6 ] =Wj\Yb], 
which implies that Yj and Y k are conditionally uncorrelated given Y b . □ 
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